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1  Introduction 

This  project  was  based  on  the  premise  that  making  systems  difficult  to  infiltrate,  detecting  and  infiltrations 
as  soon  as  possible,  and  expelling  attackers  when  detected  may  not  be  enough,  and  in  some  cases  may 
actually  be  detrimental.  A  fundamental  problem  with  this  approach  is  that  it  encourages  attackers  to  try 
again,  and  in  subsequent  attempts,  they  are  likely  to  have  learned  more  about  the  defender’s  system’s 
defenses  than  the  defender  has  learned  about  the  attacker.  Consequently,  each  attack  is  more  likely  to 
achieve  the  attacker’s  objectives.  From  this  perspective,  a  cyber  defense  strategy  should  not  only  keep 
attackers  out,  but  should  also  enable  a  defender  to  learn  about  an  attacker’s  methods  and  intentions  faster 
than  the  attacker  can  learn  about  the  defender. 

Our  work  during  the  beginning  of  the  project  focused  on  assessing  the  potential  cost  of  always  expelling 
attackers  versus  following  an  optimized  policy  that  considered  the  value  of  keeping  attackers  in  the  system 
to  learn  about  their  motives.  This  work  is  described  in  Section  2  of  the  report.  At  this  stage,  the  work 
presumed  that  the  defender  could  learn  about  the  attacker  by  observing  it  in  the  system,  but  did  not 
study  in  detail  how  that  learning  would  occur.  Later,  in  the  project  the  work  focused  on  this  learning 
-  particularly  the  problem  of  classifying  attackers  in  situations  in  which  the  attacker  is  adjusting  their 
strategy  to  make  classification  more  difficult.  This  part  of  the  investigation  is  described  in  Section  3. 


2  Expelling  Attackers 

During  the  course  of  an  attack,  the  defender  may  choose  either  to  expel  the  attacker  once  he  detects  his 
presence,  or  to  keep  his  in  the  system  in  order  to  observe  and  learn  about  the  attacker.  If  the  defender 
could  “out-learn”  the  attacker,  i.e.  learn  about  the  attacker  faster  than  he  learns  about  the  defender, 
with  the  help  of  that  intelligence  the  defender  may  be  able  to  totally  thwart  the  attacker’s  infiltration  and 
ensure  the  security  of  the  system  against  this  attacker  in  the  long  run. 

2.1  Model 

We  use  a  simple  discrete-time  MDP  to  model  the  system.  Our  model  proceeds  in  discrete  time  slots 
indexed  by  k.  At  any  time  k,  the  state  of  the  system  is  described  by  four  state  variables.  The  state 
variable  G  {C,  NC}  describes  whether  the  attacker  is  “Connected”  to  the  secured  information  system 
or  “Not  Connected”.  Similarly,  G  {D,ND}  indicates  whether  the  defender  has  either  “Detected”  or 
“Not  Detected”  the  attacker’s  connection.  The  other  two  variables,  Xk,yk  £  [0, 1],  is  used  to  represent  the 
knowledge  that  the  attacker  and  defender  have  at  time  k,  respectively. 

The  system  evolves  according  to  the  rules  described  below,  and  the  connection  and  detection  aspects  is 
illustrated  in  Figure  1. 

•  Each  period  that  the  attacker  is  disconnected  from  the  defender’s  system,  he  may  (re-)connect  with 
probability  Connect  =  £■ 

•  Each  period  that  the  attacker  is  connected  but  not  yet  detected,  the  defender  may  detect  the  existence 
of  attacker  with  probability  ^detect  =  S.  This  probability  reflects  the  capability  of  the  Intrusion 
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Figure  1:  The  dynamics  of  the  connection  and  detection  processes. 


Detection  System  (IDS)  deployed  by  the  defender. 

•  After  connection  (in  state  (C,  •)),  the  attacker  could  achieve  his  objective  with  probability  psucceSs(fc)  = 
737(1  —  yk)  where  7  E  (0, 1]  is  a  scalar.  Note  that  ^success  increases  as  the  attacker’s  knowledge  Xk 
grows,  and  decreases  as  the  defender’s  knowledge  yk  grows.  We  refer  to  this  formulation  as  the  single 
goal  formulation.  In  an  alternative  formulation,  we  consider  the  case  where  the  attacker  may  have 
multiple  goals  during  the  course  of  an  attack  and  each  goal  will  yield  1  unit  of  reward  to  the  attacker, 
the  quantity  707(1  —  yk)  is  then  the  expected  reward  for  the  attacker  in  each  period;  and  we  call  it 
multiple  goals  formulation. 

•  During  periods  that  the  attacker  is  connected  (in  state  (C,  •)),  he  could  gather  information  about  the 
defender’s  system.  The  defender  may  also  learn  about  the  attacker  during  periods  that  the  defender 
knows  the  presence  (in  state  (C,  D ))  of  the  attacker.  To  make  the  model  tractable,  we  use  two  simple 
types  of  learning  curves  to  model  the  knowledge  increase  -  geometric  and  linear  learning  curves.  For 
the  geometric  case ,  Xk ,  yk  evolves  according  to  the  following  recursive  expressions  during  an  learning 
period: 

xk  =  xk- 1  +  a(  1  -  xk- 1), 
yk  =  yk- 1  +  /?(1  -  Vk- 1), 

where  a,  ft  E  (0, 1)  are  the  corresponding  learning  parameters  represents  the  speed  of  learning  of  the 
attacker  and  defender,  respectively.  We  also  consider  the  linear  case  which  facilitates  the  analysis  of 
the  corresponding  MDP  problem  as  to  make  the  state  space  finite: 

if  0  <  k  <  [l/aj 

otherwise 

if  0  <  k  <  LV/3J 

otherwise 

where  a,/3  E  (0, 1)  are  the  slopes  of  the  corresponding  learning  curves. 

•  Every  period  that  the  attacker  is  connected  and  detected,  the  defender  has  a  control  decision  to  expel 
the  attacker  or  not.  Expelling  the  attacker  will  drive  the  system  to  state  (NC,  ND)-,  otherwise,  the 
system  stays  at  (C,  D )  for  one  period.  This  is  the  only  state  where  the  defender  has  the  opportunity 
to  apply  control,  and  the  attacker  has  no  control  choice  in  this  model. 


Xk  = 


Vk  = 
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Figure  2:  The  performance  of  Always-Expel  policy. 


2.2  Analysis 

We  study  this  model  by  formulating  it  as  a  Markov  Decision  Process  (MDP).  The  details  are  given  in 
[1].  One  observation  that  is  used  in  the  analysis  is  that  when  the  defender  knowledge  t/k  reaches  1,  the 
evolution  of  the  model  effectively  ends  since  the  attacker  cannot  gain  anything  more  from  the  system. 
Thus  the  model  with  the  linear  learning  curves  should  stop  evolving  after  a  finite  time.  This  allows  us  to 
model  it  as  a  Stochastic  Shortest  Path  (SSP)  problem  [2].  Another  observation  is  that  if  an  attacker  has 
a  probability  of  not  returning  after  an  expulsion,  it  can  be  modeled  as  a  discounted  MDP  since  future 
costs  are  discounted  by  the  chance  that  the  attacker  will  have  given  up  by  that  future  time.  Conversely,  a 
persistent  attacker  is  modeled  with  an  undercounted  MDP.  For  the  persistent  attacker,  we  can  use  these 
observations  to  show  some  monotonicity  properties  of  the  defender’s  value  function  which  in  turn  lead  to 
this  result 

Prop.  1  For  the  undiscounted  MDP  (persistent  attacker)  with  linear  learning  curves,  the  Never-Expel 
policy  is  optimal  and  dominates  any  other  policy. 

Some  further  technical  arguments  extend  the  above  proposition  to  the  other,  geometric  learning  curve  as 
well.  We  also  consider  the  following  embellishment  to  the  original  model. 

•  Boosting  Factor  Upon  Expulsion:  A  simple  embellishment  of  the  original  model  is  to  introduce  a 
knowledge  boosting  factor,  /.  When  the  defender  chooses  to  expel  the  attacker,  the  attacker’s 
knowledge  grows  according  to  Xk  =  x^-i  +  fa(  1  —  Xfc_i)  (geometric  case)  or  xj.  =  Xk-\  +  fa  (linear 
case)  where  /  >  1  and  k  is  the  time  index.  This  expression  reflects  the  possibility  that  the  attacker 
may  learn  faster  in  an  expulsion  period  than  in  a  period  he  stays  connected  without  expulsion. 
This  reflects  the  effect  that  the  attacker  may  learn  something  about  the  reason  of  his  failure  (being 
detected)  so  that  he  can  improve  tactics  next  time.  This  embellishment  shall  not  affect  the  results 
derived  in  this  section  because  it  only  makes  expulsion  less  attractive. 

2.3  Simulation  Results 

Without  formulating  the  model  into  an  MDP,  one  can  simulate  the  evolution  of  the  attack-defense  process 
under  different  defender  policies.  To  begin  with,  it  is  interesting  to  compare  two  extreme  policies  Always- 
Expel  and  Never-Expel.  In  Figure  2,  3  the  result  of  a  typical  sample  path  is  displayed  for  the  geometric 
case.  The  performance  measure  is  defined  as  the  cumulative  probability  of  attacker  success  (the  single  goal 
formulation),  and  the  parameter  choice  is  a  =  .02,  /3  =  .05, 7  =  .01,  e  =  .05,  5  =  .05.  Besides,  we  assume  at 
the  initial  state  ( k  =  0)  both  attacker  and  defender’s  have  zero  knowledge  and  the  attacker  is  not  connected 
(state  (-/VC,  ND )),  and  simulate  the  system  from  time  k  =  0  to  1000.  The  left  plot  of  Figure  2  shows  that 
the  attacker’s  knowledge  grows  faster  than  that  of  the  defender  under  the  Always-Expel  policy,  and  the 
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Time  k  Time  k  Time  k 


Figure  3:  The  performance  of  Never-Expel  policy. 


middle  plot  indicates  that  the  cumulative  probability  of  attacker  success  exceeds  90%.  Similar  plots  are 
provided  in  Figure  3  for  the  Never-Expel  policy.  By  comparing  the  left  plots  of  Figure  2  and  3,  one  could 
see  that  the  defender’s  knowledge  grows  faster  in  the  latter  and  the  defender  successfully  out-learn  the 
attacker.  Consequently,  the  cumulative  probability  of  attacker  success  is  under  12%,  an  order  of  magnitude 
of  improvement  from  the  rather  naive  Always- Expel  policy.  The  right  plots  in  both  figures  demonstrate  the 
evolution  of  the  per-period  probability  of  attacker  success.  One  could  observe  the  rapid  defender  learning 
under  the  Never-Expel  policy  results  in  the  drastic  drop  of  the  per-period  attacker  success  probability 
which  further  explains  the  advantage  of  the  out-learning  strategy. 

2.3.1  Structure  of  the  Optimal  Policy 

By  way  of  policy  iteration,  it  is  easy  to  compute  the  optimal  stationary  policy  consisting  of  decisions  only 
depends  on  states.  Figure  4  illustrates  the  optimal  policy  of  the  discounted  MDP  under  the  parameter 
choice:  p  =  .89,a  =  .09,  /3  =  .13,7  =  -05,  e  =  .05,(5  =  .05.  The  optimal  policy  is  displayed  in  a  control- 
matrix  form  where  each  entry  corresponds  to  the  optimal  decision  of  the  defender  in  state  (x,y,  3).  By 
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Figure  4:  The  optimal  policy  in  control-matrix  form. 

experiments  with  various  parameter  settings,  we  observe  that  the  optimal  policy  of  the  discounted  MDP 
always  possesses  a  lower-triangular,  threshold-like  structure.  Roughly,  with  a  fixed  amount  of  knowledge 
the  defender  shall  switch  from  Not-Expel  to  Expel  as  the  attacker  knowledge  grows.  Note  this  is  similar 
to  the  idea  of  a  threshold  policy  where  the  optimal  control  switch  from  one  to  another  as  the  state  exceeds 
some  threshold  point.  This  is  quite  intuitive  in  that  for  fixed  defender  knowledge,  the  more  the  attacker 
knows  about  the  defender’s  system,  the  more  immediate  damage  he  may  impose.  Consequently,  it  might  be 
more  preferable  to  expel  the  attacker  and  avoid  relatively  significant  immediate  costs  for  a  while;  moreover, 
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since  there  is  some  positive  probability  that  the  attacker  may  give  up  in  each  period,  postponing  the  attack 
by  expulsion  is  more  advantageous  than  the  out-learning  strategy.  On  the  other  hand,  as  the  defender 
knowledge  grows  larger  there  are  less  Expel  entries  in  the  optimal  control-matrix.  Again,  this  could  be 
understood  from  the  fact  that  the  immediate  cost  is  decreasing  w.r.t  the  defender  knowledge. 

2.4  Conclusion 

We  have  considered  the  defender’s  policy  optimization  problem  with  presence  of  the  learning  effect  in  a 
security  context.  By  formulating  the  problem  into  a  Markov  decision  process,  we  are  able  to  analyze  the 
characteristics  of  the  optimal  policy.  If  the  attacker’s  is  persistent,  the  optimal  strategy  of  the  defender  is 
to  keep  the  attacker  in  the  system  in  order  to  out-learn  him  and  eventually  thwart  the  attacks,  which  is 
quite  different  from  the  conventional  idea  of  expelling  the  attacker  whenever  detecting  his  presence.  For 
the  case  where  the  attacker  may  give  up,  it  can  be  formulated  into  a  discounted  MDP,  and  we  observe  that 
the  optimal  policy  in  this  case  has  certain  structure  by  numerical  experiments. 

Our  model,  although  quite  stylized,  is  able  to  capture  the  interesting  effects  when  one  considers  the  learning 
effect  in  a  cyber-defense  scenario.  It  demonstrates  the  potential  benefit  of  gathering  intelligence  from  the 
attacker  during  the  course  of  a  defense.  This  idea  yields  a  new  perspective  in  studying  the  network  security 
problems. 


3  Classification 

Our  work  on  classification  can  be  divided  into  two  categories.  Our  work  earlier  in  the  project  looked  at  a 
model  in  which  the  attacker  chooses  a  single  number,  the  rate  at  which  to  attack.  This  work  is  described 
in  Section  3.1.  The  second  category  of  work,  done  later  in  the  project,  supposed  the  attacker  could  pick  a 
distribution  across  “attack  strengths”  (mixed  strategy).  This  work  is  described  in  Section  3.2. 

3.1  Attacker  Chooses  Real  Number  Valued  “Attack  Strength” 

This  section  summarizes  work  published  in  [3]  The  model  is  illustrated  in  Figure  5.  A  network  defender 
faces  an  attacker  that  can  either  be  a  spy  or  spammer  with  probabilities  p  and  1  —  p  respectively.  The 
defender  has  two  servers  that  can  be  attacked,  a  File  Server  (FS)  and  a  Mail  Server  (MS).  We  suppose 
that  spammers  attack  the  MS  most  often  because  they  want  to  send  spam  and  to  get  the  addresses  of 
potential  victims.  However,  a  spammer  occasionally  hits  the  FS  as  he  explores  the  defender’s  information 
system  looking  for  other  potential  targets.  We  suppose  time  is  discrete,  and  in  each  period  k ,  a  spammer 
hits  the  FS  with  probability  9$  <  |  and  otherwise  he  hits  the  MS.  The  attacks  are  restricted  to  be  i.i.d. 
Bernoulli  from  period  to  period.  Moreover,  we  suppose  the  defender  can  observe  the  sequence  of  attacks 
Zk  £  {MS,  FS}.  Spammers  are  supposed  to  be  non-strategic,  so  8q  is  taken  to  be  a  fixed  parameter  in  the 
model. 

A  spy  has  to  choose  the  frequency  with  which  to  hit  the  FS,  which  is  what  he  prefers  to  attack  as  that  is 
where  the  information  he  wants  is  stored.  However,  he  also  can  choose  to  hit  the  MS  during  some  time 
periods  to  make  it  more  difficult  for  the  defender  to  distinguish  him  from  a  spammer.  We  suppose  that 
the  spy’s  strategy  is  to  pick  a  single  probability  9\  of  hitting  the  FS  in  any  period.  Once  the  spy  picks  9\, 
his  attacks  on  the  FS  are  restricted  to  be  Bernoulli.  If  he  picks  6\  too  high,  then  it  will  be  easy  for  the 
defender  to  distinguish  him  from  a  spammer;  if  he  picks  8\  too  low,  then  he  reduces  the  frequency  with 
which  he  gets  to  attack  the  desired  target. 

The  defender  has  to  decide  in  each  period  whether  to  classify  the  attacker  as  a  spy  or  a  spammer,  or  to 
do  nothing  and  keep  observing.  When  a  spammer  is  attacking,  the  defender  pays  a  penalty  co  each  time 
he  hits  the  MS  and  pays  a  penalty  F  for  mis-classifying  a  spammer  as  a  spy.  When  a  spy  is  attacking, 
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Figure  5:  An  illustration  of  the  classification  game. 


the  defender  pays  a  penalty  ci  for  each  hit  on  the  FS,  which  appears  as  a  payoff  —c\  to  the  spy.  If  the 
defender  correctly  classifies  a  spy,  the  game  ends  and  the  spy  pays  a  penalty  L.  However,  if  the  defender 
mis-classifies  a  spy  as  a  spammer,  we  suppose  that  the  spy  can  then  continue  to  attack  with  impunity 
and  thus  earns  a  reward  equal  to  the  discounted  net  present  value  of  an  endless  stream  of  FS  attacks  that 
happen  with  probability  0\  in  each  period.  This  mis-classihcation  reward  to  the  spy,  like  the  the  spy’s 
rewards  for  all  preceding  FS  attacks,  appears  as  a  penalty  to  the  defender. 

In  this  work  we  consider  two  versions  of  this  game.  In  the  first  version  (Section  3.1.1),  we  suppose  that  the 
defender’s  strategy  is  to  commit  to  a  fixed  number  of  observation  periods  N.  Simultaneously,  spies  pick 
9\.  At  the  end  of  N  periods,  the  defender  makes  the  classification  decision  that  minimizes  his  expected 
cost  given  the  observations.  We  call  this  the  fixed  N  game.  We  find  that  this  game  has  no  pure  Nash 
equilibrium  by  experiments  covering  the  whole  parameter  space. 

In  the  second  version  (Section  3.1.2),  the  defender  does  not  commit  to  an  observation  period  but  instead  can 
decide  to  keep  taking  observations,  depending  on  what  has  been  observed  so  far.  We  call  this  the  dynamic 
N  game.  The  defender’s  best  response  to  a  given  9\  is  similar  to  the  well  known  Sequential  Probability 
Ratio  Test  (SPRT)  [4].  In  a  SPRT  (with  Binomial  data  and  two  hypothesis),  one  keeps  track  of  the  Log 
Likelihood  Ratio  (LLR),  which  evolves  like  a  one-dimensional  random  walk  as  observations  come,  and 
makes  a  classification  when  the  LLR  crosses  either  an  upper  or  lower  threshold.  For  a  given  9\,  the  best 
response  of  the  defender  is  to  use  an  SPRT-like  test  with  particular  thresholds  (that  can  be  numerically 
computed)  and  a  hypothesis  for  9 \  that  matches  the  value  the  spy  is  actually  using.  If  we  fix  a  defender 
strategy  (SPRT  thresholds  and  hypothesis  9\)  it  might  be  that  the  spy’s  best  response  is  to  play  with  a 
9 1  that  does  not  match  what  the  defender  is  expecting.  A  Nash  equilibrium  of  the  game  would  be  a  point 
where  attacker  9\  and  the  defender’s  hypothesis  9\  match.  We  find  that  the  dynamic  N  game  can  exhibit 
such  a  Nash  equilibrium  by  experiments  leveraging  available  computational  tools  for  Partially-Observable 
Markov  Decision  Processes  (POMDPs)  [5]  and  other  finite-state  Markov  reward  processes  [2], 

3.1.1  Fixed  N  game 

In  the  Fixed  N  game,  the  defender  employs  a  fixed-sample-size,  uniformly  most  powerful  (UMP)  test  [6]. 
We  start  by  considering  a  simple- vs-simple  test  Hq  :  9  =  9o  versus  H\  :  9  =  9\.  Recall  that  the  observations 
of  server  hits  are  modeled  as  a  sequence  of  i.i.d  Bernoulli  random  variables  conditioned  on  the  true  type 
of  an  attacker.  Therefore,  the  likelihood-ratio,  given  a  vector  of  observations  z^  =  (z\, ... ,  zn),  is  given 
by  A(zat)  =  P  [zat  |  A  =  1]  /P  [zN  \  X  =  0]  =  [6>i(l  -  90)/(l  -  9i)90]z  •  [(1  -  0i)/(l  -  flo)]^  where  2  := 
7^_0  Zk  simply  counts  the  number  of  FS  attacks.  By  the  Neyman-Pearson  lemma  [6],  a  test  with  decision 
rule  “rejecting  Hq  if  A(z jv)  >  M”  and  a  Type-I  error  probability  a  is  a  level  a  UMP  test;  moreover,  it  is  easy 
to  check  the  condition  A(z n)  >  M  is  equivalent  to  z  >  m  for  some  integer  m.  However,  the  defender  has 
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no  access  to  the  value  of  9\  chosen  by  a  spy  in  our  game  since  both  players  act  simultaneously.  Therefore, 
the  defender  in  fact  carries  out  an  one-sided  test  Hq  :  9  <  9q  vs  Hi  :  9  >  9q.  By  properly  choosing 
an  alternative  hypothesis  H[  :  9  =  6\,  the  defender  may  effectively  transform  the  one-sided  test  into  a 
simple-vs-simple  one;  moreover,  by  Karlin-Rubin  theorem  [6]  the  aforementioned  decision  rule  still  yields 
a  test  with  the  smallest  mis-detection  rate  among  all  tests  with  the  desired  false-alarm  rate  level. 

From  above  discussion,  we  see  that  the  defender’s  strategy  is  a  pair  of  non-negative  integers  (N,  m)  such 
that  m  <  N. 

To  study  the  existence  of  pure  Nash  equilibrium,  we  adopt  the  standard  approach  of  deriving  best  response 
mappings  of  both  players  and  checking  for  intersections  point (s).  Due  to  the  complexity  of  our  payoff 
functions,  there  are  no  close- form  expressions  for  the  best  responses  and  we  resort  to  extensive  numerical 
experiments.  It  turns  out  that  a  pure  Nash  equilibrium  fails  to  exist  in  our  Fixed  N  game.  For  many 
problem  instances,  we  find  that  the  the  attacker’s  best  response  function  is  discontinuous.  For  these 
examples,  When  a  defender  commits  to  a  large  N,  the  attacker’s  best  response  is  to  attack  aggressively 
by  choosing  a  0\  that’s  large.  Conversely,  when  the  defender  commits  to  a  small  N ,  the  attacker’s  best 
response  is  the  pick  a  0\  close  to  9q  to  avoid  detection.  This  discontinuity  makes  it  such  that  the  best 
response  functions  of  the  two  players  never  intersect  -  and  such  an  intersection  is  what  is  needed  to  have 
a  Nash  equilibrium  point. 


3.1.2  Dynamic  N  Game 


In  this  section,  we  remove  the  restriction  that  the  defender  commits  to  a  fixed  observation  time.  That  is, 
as  in  the  famous  Wald  problem  [4],  the  number  of  observations  N  before  classification  depends  not  just 
on  the  two  players’  strategies  but  also  on  the  particular  observation  sequence  z*.  =  (zo,  z\, . . . ,  z^).  The 
spy’s  problem  remains  essentially  the  same  as  in  the  preceding  section,  namely  to  select  how  frequently  to 
hit  the  file-server  relative  to  the  mail-server  (i.e. ,  the  value  of  probability  9\).  Note  that  the  spy  has  no 
obligation  (and,  in  fact,  generally  has  incentives  not)  to  behave  as  hypothesized  by  the  defender  (i.e.,  the 
spy’s  parameter  9\  may  differ  from  the  value  9\  hypothesized  by  the  defender).  The  question  we  seek  to 
answer  is  whether  it  is  possible  for  the  defender  to  hypothesize  a  value  for  9\ ,  and  design  his  best  response 
strategy  accordingly,  such  that  the  spy’s  best  response  yields  9\  =  8\. 


If  the  defender  were  to  hypothesize  the  true  value  of  9\ ,  then  results  for  the  Wald  problem  imply  that  the  de¬ 
fender’s  best  response  function  takes  the  form  of  a  Sequential  Probability  Ratio  Test  (SPRT)  parametrized 
by  two  probability  thresholds  we  denote  by  r]  and  £  >  1  —  r/  i.e.,  initialize  probability  6_i  =  p  and,  in  each 
stage  k  =  0, 1,  2  ... ,  first  apply  the  probabilistic  state  recursion 


P  [X  =  1  |  z k\  =  bk  = 


_ (l  -  gQbfc-i _ 

(1  -  0o)(l  -  h- 1)  +  (1  -  0i)bk-i 

9\bk~\ 


if  Zk  =  MS 


if  Zk  =  FS 


#o(l  —  bk-i)  +  9\bk-i 

and  then  choose  to  classify-spammer  if  bk  <  1  —  77,  to  classify-spy  if  bk  >  £,  and  to  continue  otherwise. 
Under  the  assumption  that  the  defender  possesses  no  knowledge  on  how  the  spy  may  play,  the  only  option 
is  to  employ  the  SPRT  strategy  for  some  hypothesis  9\  on  the  spy’s  strategy.  We  denote  such  hypothesis- 
dependent  SPRT  thresholds  by  ??(#i)  and  £($i).  We  similarly  denote  the  defender’s  associated  cost  in 
by  Jd(9\\9i),  also  reflecting  its  dependence  on  the  spy’s  true  choice  of  9\.  Recognizing  the  defender’s 
best  response  model  as  a  special  case  of  the  well-studied  Partially  Observable  Markov  Decision  Process 
(POMDP)  [5],  we  appeal  to  a  publicly  available  POMDP  solver  (see  http://www.pomdp.org)  to  both 
optimize  the  SPRT  thresholds  and  compute  the  defender’s  performance  Jd{9i\9\)  if  the  hypothesis  were 
in  fact  true. 


The  spy’s  best  response  considers  the  defender’s  strategy  as  given,  i.e.,  hypothesis  9\  and  the  associated 
SPRT  thresholds  r](9 1)  and  £($i)  are  known  to  the  spy.  In  turn,  for  any  choice  of  the  true  9\ ,  denote  the 
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spy’s  associated  cost  by  JA(0i\0\).  It  follows  that  the  spy’s  best  response  is  the  value  of  8\  €  (do,  1]  that 
minimizes  Ja(9\\9i),  or  equivalently  the  value  of  6\  that  maximizes  the  incentive  Ja(9\\9i)  —  JA(0i\0\) 
to  deviate  from  the  defender’s  hypothesis  Q\ .  Our  computation  of  the  spy’s  response  relies  on  a  nonlinear 
program,  each  iteration  on  a  candidate  value  for  9\  involving  the  construction  and  solution  of  a  finite- 
state  Markov  chain  that  exploits  two  properties  of  the  defender’s  SPRT  strategy.  Firstly,  the  defender’s 
probabilistic  state  recursion  can  (until  classification)  be  equated  with  a  random  walk  along  the  real  line 
involving  the  defender’s  log-likelihood  ratio  (LLR) 

fP[zk  \  X  =  l}\  j  Rk-i  +  \og  ,  if  Zk  =  MS 

°gVp[zfc  |  X  =  0\)  |i2fc_1+log^  ,  if  Zk  =  FS 

starting  from  the  origin  i?_ i  =  0.  In  turn,  the  SPRT  thresholds  (and  prior  probability  p)  determine  the 
segments  of  the  real-line  corresponding  to  the  three  control  actions  available  to  the  defender  i.e. ,  choose 

to  classif  y-spammer  if  R.y-  <  log  ,  to  classify-spy  if  Ry  >  log  ,  and  to 

continue  otherwise. 

Secondly,  the  the  spy’s  strategy  9\  alters  the  statistics  of  this  random  walk,  lower  (higher)  values  increasing 
the  chances  that  the  LLR  first  exits  the  continue  region  at  the  lower  (upper)  end  of  the  real-line.  The 
Markov  chain  representation  involves  Q  +  3  states,  Q  of  them  indexing  the  levels  of  a  uniform  quantization 
of  the  LLR  continue  region,  one  indexing  an  initial  state  and  two  indexing  terminal  states  (one  per  classify 
decision).  The  transition  probabilities  reflect  not  only  the  spy’s  response  9\  and  the  increments  Ry  —  Rk- 1 
of  the  defender’s  LLR  walk,  but  also  the  noise  introduced  by  the  quantization.  The  transition  costs 
reflect  the  spy’s  rewards  from  file-server  attacks  and  evading  detection,  as  well  as  the  spy’s  cost  of  actual 
detection.  Then,  in  each  iteration  of  the  nonlinear  program,  standard  techniques  for  Markov  chains  [2] 
can  be  employed  to  approximate  the  expected  total  discounted  cost  when  6\  is  not  necessarily  equal  to  0\ . 
We  omit  further  details  here,  but  the  outputs  of  this  method  are  a  particular  value  for  the  spy’s  policy 
parameter  8\  €  (6q,  1]  and  the  spy’s  associated  cost  Ja(9i\8\). 

For  the  game-aware  defender,  the  key  question  is  whether  there  exists  a  hypothesis  9\  from  which  the  spy 
has  no  incentive  to  deviate,  choosing  9 \  =  9\ .  Fig.  6  illustrates  an  empirical  solution  obtained  via  the 
computational  methods  and  approximations  discussed  above,  where  for  each  hypothesis  9\  we 

1.  employ  the  POMDP  solver  to  obtain  for  the  defender’s  (a)  SPRT  thresholds  r/(9 1)  and  £(#i)  as  well 
as  (b)  penalty  JD(9\\9)-,  then 


Defender’s  Response  Defender’s  Performance  Spy’s  Response 


Spy’s  Performance 


(b) 


(d) 


Figure  6:  Equilibrium  solution  to  the  dynamic  N  game  with  p  =  0.5,  =  0.1,  5  =  0.95,  Co  =  0.01,  ci  =  1 

and  L  =  F  =  50.  Each  marked  point  on  the  defender’s  response  and  performance  curves  (plots  (a)  and 
(b),  respectively)  is  obtained  via  the  POMDP  solver,  while  each  marked  point  on  the  spy’s  response  and 
performance  curves  (plots  (c)  and  (d),  respectively)  is  obtained  via  the  nonlinear  program  iterating  on  a 
quantized  approximation  (with  Q  =  100)  of  the  defender’s  SPRT  solution.  The  equilibrium  point  is  where 
the  spy’s  response  curve  in  (c)  intersects  the  9\  =  9 \  line. 


2.  employ  the  nonlinear  program  to  obtain  the  spy’s  (c)  true  9±  and  (d)  the  penalty  JA(9A\9\)  (where  we 
also  show  its  comparison  to  penalty  JA{9\\9\)  were  the  spy  to  behave  as  the  defender  hypothesizes). 

Our  procedure  starts  with  values  for  9\  over  the  whole  interval  (#o,  1]  in  increments  of  0.1.  Then,  it  identifies 
each  sub-interval  in  which,  under  the  assumption  of  continuity,  there  may  lie  a  point  in  which  9 \  =  9\ . 
The  procedure  continues  with  values  for  9\  over  each  such  sub-interval  in  increments  of  0.01,  and  so  on. 
The  procedure  terminates  at  a  pre-specified  precision,  which  in  Fig.  6  was  set  to  0.001.  The  markers  on 
each  curve  in  Fig.  6  denote  the  selected  values  of  9\  over  the  entire  procedure.  The  solution  point  in  each 
plot  corresponds  to  the  value  of  9\  at  which  the  spy’s  response  9\  in  Fig.  6(c)  is  nearest  to  the  6\  =  9\  line. 

Observe  that  the  two  players’  response  functions  exhibit  a  smooth  confusion  versus  exploitation  trade-off. 
That  is,  for  hypotheses  9\  close  to  9q ,  the  defender’s  SPRT  thresholds  are  such  that  spammer  classifications 
are  made  frequently  and  nearly  immediately:  in  other  words,  when  the  defender  anticipates  a  spy  favoring 
confusion,  his  strategy  reduces  to  near-immediate  expulsion  of  either  type  of  attacker  and,  in  turn,  the 
spy’s  response  is  to  hit  the  FS  at  every  opportunity.  For  hypotheses  9\  away  from  9q,  the  defender’s  SPRT 
thresholds  are  such  that  classification  is  deferred:  in  other  words,  when  the  defender  anticipates  a  spy 
favoring  exploitation,  his  strategy  allows  for  the  time  to  reliably  classify  either  type  of  attacker  and,  in 
turn,  the  spy’s  response  is  to  evade  detection  by  hitting  the  FS  almost  as  infrequently  as  spammers  do. 
For  the  model  parameters  in  Fig.  6,  the  equilibrium  point  of  9\  ~  0.152  neutralizes  all  incentive  for  the 
spy  to  either  confuse  an  exploitation-oriented  defense  or  to  exploit  a  confusion-oriented  defense. 

3.1.3  Conclusion 

In  this  part  of  the  project,  we  developed  a  security  classification  game.  The  defender  tries  to  effectively  clas¬ 
sify  the  attackers  (spammer  or  spy)  while  controlling  the  damage  during  the  period  of  gathering  evidence. 
A  strategic  spy  faces  the  trade-off  between  (i)  exploiting  the  defender’s  observation  time  by  attacking  ag¬ 
gressively  and  (ii)  confusing  the  defender  by  mixing  attacks  thereby  enjoying  the  benefits  of  mis-detection. 
The  non-existence  of  pure  Nash  equilibrium  of  our  fixed  N  game  suggests  that  an  over-simplified  strategy 
adopted  by  the  defender  will  never  lead  the  game  to  settle  to  a  stable  point  where  both  players  behave 
predictably.  This  problem  is  mitigated  by  allowing  the  defender  to  make  decisions  at  each  period  of  time 
in  our  dynamic  N  game,  which  essentially  dis-incentivizes  the  spy’s  response  to  drastically  shift  from 
aggressive  exploitation  to  moderate  confusion. 

3.2  Attacker  Chooses  Mixing  Distribution 

The  results  of  this  section  are  described  in  more  detail  in  [7]. 

3.2.1  Basic  Model 

The  game  model  is  as  follows.  As  in  the  earlier  family  of  models,  Nature  decides  the  type  of  an  attacker  in 
a  network:  spy  or  spammer  with  probabilities  p  and  1  —  p  respectively.  The  network  consists  of  a  defender 
and  two  servers  that  might  be  attacked:  a  File  Server  (FS)  with  sensitive  data  and  a  Mail  Server  (MS)  with 
contents  of  inferior  importance.  The  spy’s  goal  is  to  attack  the  FS  as  frequently  as  possible  while  evading 
detection,  and  the  spammer’s  goal  is  to  attack  the  MS  to  congest  the  network  or  annoy  the  defender. 
The  defender  is  a  strategic  player  who  monitors  the  two  types  of  servers  at  each  time  slot  (we  consider 
discrete  time).  We  assume  a  constant  classification  window  of  N  time  slots,  during  which  the  defender 
observes  the  number  of  attacks  to  the  FS.  The  spammer  is  a  non-strategic  player,  who  attacks  on  the  FS 
S  time  slots  with  a  known  cumulative  distribution  function.  For  instance,  he  can  be  modeled  to  have  a 
Bernoulli  distribution  at  each  time  slot  with  a  small  per-period  probability  9q  of  a  hit  on  the  FS.  For  a 
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fixed  observation  window  of  N  time  slots,  the  defender  selects  the  threshold  T  of  time  slots,  below  which 
he  classifies  the  attacker  as  a  spammer  and  the  spy  selects  the  number  of  FS  attacks  H  to  launch. 

Attacker’s  cost  function:  The  spy  is  detected  when  the  defender’s  threshold  T  is  smaller  or  equal  to  the 
spy’s  selection  of  H  (the  number  of  FS  attacks).  In  this  case,  he  has  a  cost  of  Cd ■  We  assume  that  each 
FS  hit  gives  the  spy  some  benefit  captured  by  the  parameter  ca.  We  also  assume  that  he  gains  nothing 
from  attacking  the  MS.  His  overall  gain  from  the  attacks  is  proportional  to  the  number  of  time  slots  H  he 
selected  to  attack.  Since  it  will  be  useful  to  work  with  a  cost  function  for  the  attacker  rather  than  a  payoff 
function,  we  subtract  the  gain  from  the  attacks.  Thus,  his  overall  cost  function  can  be  expressed  as  follows 

Ja(T,  H)  =  cd-  1  t<h  -ca-H, 

where  1  t<h  is  1  if  T  <  H  and  0  otherwise. 

Defender’s  reward  function:  The  defender’s  expected  reward  function  depends  on  the  true  type  of  the 
attacker.  In  the  case  that  he  faces  a  spy  (which  happens  with  probability  p ),  he  makes  a  correct  classification 
and  gains  cd  when  his  threshold  T  <  H .  He  always  gets  a  cost  from  the  FS  attacks  which  is  proportional 
to  H.  With  probability  1  —  p  he  faces  a  spammer  who  selects  to  attack  S  time  slots.  For  a  fixed  T,  we 
denote  by  <f>{T)  =  Pr{»S  >  T}  the  probability  that  the  spammer  attacks  at  least  T  times  on  the  FS.  Then, 
the  defender  has  an  expected  false  alarm  penalty  of  Cfa  ■  <f(T )  and  his  total  expected  payoff  is 

Ud(T ,  H)  =  p  ■  (cd  ■  1  t<h  ~  Ca  ■  H)  -  (1  -  p)  ■  Cfa  ■  (f{T). 

By  scaling  the  above  function,  we  finally  get 

Ud(T,  H)  =  cd-  1  t<h  —  ca-H  —  p{T), 

where  p(T)  =  ■  Cfa  ■  f(T).  We  assume  that  <f>(T)  is  strictly  decreasing  with  T . 

3.2.2  Players’  interactions 

For  a  fixed  classification  window  N  the  spy  has  IV +1  available  actions:  attack  the  file  server  H  £  {0, . . . ,  N} 
times,  whereas  the  defender  has  N  +  2  available  actions:  select  T  £  {0, . . . ,  N  +  1}  as  the  classification 
threshold.  A  threshold  of  0  always  results  in  spy  classification  (as  any  intruder  will  attack  the  FS  at  least 
0  times),  and  a  threshold  of  N  +  1  always  results  in  spammer  classification. 

We  model  our  problem  as  a  nonzero-sum  game,  where  the  term  in  the  defender’s  payoff  that  is  different 
than  the  spy’s  cost  depends  only  on  the  defender’s  strategy.  In  the  literature  these  games  are  known  as 
almost  zero-sum  games  or  quasi  zero-sum  games.  We  are  interested  in  Nash  equilibria  in  mixed  strategies 
for  the  following  reason.  On  the  one  side,  the  spy  seeks  to  select  a  number  of  attacks  just  below  the 
defender’s  threshold.  On  the  other  side,  the  defender  aims  to  select  a  threshold  equal  to  the  attacker’s 
strategy.  Thus  the  players  need  to  mix  between  different  strategies  to  make  themselves  less  predictable. 
The  spy  chooses  a  distribution  a  on  the  available  numbers  of  FS  hits  -  thus  a  is  a  vector  of  size  N  + 1  with 
non  negative  elements  that  sum  to  1.  Similarly  the  defender  chooses  a  distribution  (3  on  the  collection  of 
possible  thresholds  T.  Thus  (3  is  a  vector  of  size  N  +  2. 

3.2.3  Game-Theoretic  Analysis 

In  this  section,  we  state  our  main  theorem.  We  use  the  notation  “min”  when  we  find  the  minimum  element 
of  a  vector,  and  “minimize”  when  we  minimize  a  specific  expression  over  some  constraints.  We  use  the 
superscript  T  for  matrix  transposition. 

Let  A  be  a  ( N  +  1)  x  (N  +  2)  matrix  representing  the  spy’s  strategies’  cost  for  any  possible  strategy  of  the 
defender.  We  shift  A  by  a  constant  parameter  Nca  +  e,  with  e  >  0.  Thus,  A  can  be  written  in  the  following 
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The  last  all-zero  column  in  the  first  component  of  A  captures  that  is  never  caught  when  the  defender 
chooses  the  N  +  1  threshold.  With  A  defined  as  above,  the  attacker  cost  can  be  written  as  aTA(3  and  the 
defender  payoff  can  be  written  as  aTA(3  —  n1  (3.  It  will  turn  out  that  certain  computations  are  simplified 
by  using  a  a  matrix  with  only  positive  entries.  We  therefore  define 

A  =  A  +  (Nca  +  e)  ■  1(jv+i)x(jv+2) 

where  l(jv+i)x(JV+2)  is  a  matrix  of  all  ones  of  dimension  (N  +  1)  x  (N  +  2).  Since  a  and  (3  must  each  sum 
to  1,  the  expressions  aTA/3  and  aTA/3  —  fiT(3  are  respectively  the  attacker  cost  and  defender  payoff  shifted 
by  a  constant.  Since  adding  a  constant  to  a  players  payoff  does  not  affect  their  best  responses,  from  here 
on  we  will  consider  these  expressions  to  be  the  payoff  functions  of  each  player. 

For  a  given  defender  strategy,  f3,  the  minimum  attacker  cost  is  achieved  by  putting  positive  probability 
only  on  strategies  corresponding  to  the  minimum  entries  of  the  vector  A/3.  Such  a  strategy  results  in  a 
attacker  cost  of  min[A/3]  where  min  extracts  the  minimum  element  of  the  vector.  The  defender’s  payoff 
when  the  attacker  plays  a  best  response  is 

9{f3)  =  min  [A/3]  —  /it/3. 

This  function  is  important  for  our  subsequent  analysis.  Since  it  is  a  measure  of  how  “good”  a  strategy  (3 
is,  we  refer  to  0(/3)  as  the  defendability  of  (3.  This  is  similar  to  the  concept  of  “vulnerbaility”  developed  in 
[8], 

Lemma  1  In  NE,  the  defender  strategy  (3  must  maximize  0(/3). 

Proof  1  (Proof  Sketch)  The  minimum  cost  the  attacker  can  achieve  in  response  to  (3  is  6  :=  min[A/3]. 
In  Nash  Equilibrium,  the  attacker  must  be  playing  a  best  response  and  the  defender  must  not  be  able  to 
improve  payoff  with  a  unilateral  deviation.  The  attacker’s  optimization  problem,  subject  to  the  constraint 
that  he  pick  a  strategy  that  makes  the  defender  unable  to  improve  payoff  from  a  unilateral  deviation  takes 
the  form 

minimize  {3TATa 

Ot 

subject  to  a  >  0, 1T a  >  1, 

A Ta.  —  pc  <  9(l 3)1. 

The  solution  of  this  problem  needs  to  be  5,  since  if  it  were  more  than  5,  the  attacker  would  not  be  achieving 
the  minimum  possible  cost.  However,  analysis  of  the  dual  of  this  program  shows  that  the  problem  yields  a 
solution  of  5  only  if  P  is  a  maximizer  of  the  function  0(f3).  The  details  of  the  dual  program  analysis  are 
left  out  here  for  space  constraints. 

In  NE  the  defender  maximizes  defendability,  or  equivalently  he  picks  a  solution  of  the  following  LP: 

maximize  —  uT  (3  +  z 

t 3,2 

subject  to  zl  <  A/3  (1) 

1T  (3  =  1. 
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Table  1:  Defender’s  strategy  in  NE  (/3m  =  ca/cd ) 
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Figure  7:  Players’s  best  responses  in  NE  for  N  =  7,  9q  =  0.1, Cd  =  15,  ca  =  1,  c/a  =  23,  p  =  0.2. 

As  we  can  see  from  the  LP  the  defendability  is  maximized  at  one  of  the  extreme  points  of  the  polyhedron 

X 

defined  by  Ax  >  1.  Given  an  extreme  point  x,  the  corresponding  distribution  is  (3  =  tt — 

11*11 

Theorem  1  In  any  Nash  equilibrium  the  defender’s  strategy  (3  maximizes  the  defendability.  A  maximizing 
value  of  (3  exists  amongst  one  of  the  two  forms  in  Table  1  for  some  s.  If  there  is  only  one  maximizing  (3 
amongst  vectors  of  the  form  in  Table  1,  then  the  Nash  equilibrium  is  unique. 

The  theorem  is  shown  by  showing  that  an  extreme  point  vector  that  corresponds  to  a  maximizing  distri¬ 
bution  vector  of  defendability  has  certain  properties.  Most  importantly,  there  needs  to  be  one  contiguous 
block  of  tight  inequalities  in  the  equations  Ax  >  1.  Using  that  fact,  one  can  show  that  if  s  and  /  are 
the  start  and  finish  indices  of  the  contiguous  block,  then  Ps+\  through  f3f  needs  to  equal  ca/cd-  Other 
properties  can  be  used  to  show  that  /  must  either  be  N  or  N  +  1. 

3.2.4  Numerical  Results/Simulations 

We  conducted  various  experiments  for  different  sets  of  parameters  N,  ca,  Cd  and  p,  assuming  that  the  spam¬ 
mer  attacks  with  Bernoulli  distribution  with  parameter  9q.  We  first  used  the  methods  discussed  above  to 
calculate  the  strategies  of  both  players  at  equilibrium.  We  later  used  the  Gambit  software  [9]  and  vali¬ 
dated  our  theoretical  results.  We  present  here  two  characteristic  examples,  to  illustrate  the  two  possible 
structures  of  the  Nash  equilibria  (the  two  aforementioned  cases). 

Figure  7  illustrates  Case  1  and  the  unique  Nash  equilibrium  for  N  =  7  time  slots.  As  we  can  see,  all  the 
middle  points  are  given  the  same  weight  xm  =  ca/cd  =  0.0667,  xs  =  0  and  x/+i  >  xm.  The  structure  of 
the  equilibria  is  given  by  the  first  row  of  Table  1,  with  s  =  1  and  /  =  7. 

Figure  8  presents  the  unique  Nash  equilibrium  for  Ar  =  7  in  Case  2.  As  we  can  see,  again  all  the  middle 
points  are  given  the  same  weight  xm  =  ca/cd  =  0.1,  but  here  xs  >  xrn  and  x/+i  =  0.  Note  that  as  p 
increases,  larger  weight  is  given  to  the  smallest  threshold,  in  order  to  detect  the  most-probable-to-exist 
spy.  We  also  observe  that  the  defender  still  gives  some  weight  on  the  larger  thresholds  and  is  not  focused 
on  a  range  around  N6q.  This  can  be  explained  from  the  strictly  decreasing  false  alarm  cost  function  p: 
the  defender  has  always  an  incentive  to  use  larger  thresholds  to  increase  his  expected  payoff. 
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Figure  8:  Players’s  best  responses  in  NE  for  N  =  7,  9q  =  0.1, Cd  =  10,  cQ  =  1,  c/a  =  10,  p  =  (L 


Model  1 :  N  =  5,  ct  =  1,  cfl  =  10,  c,a  =  10.  8q  =  01 


(a)  NE  defender  payoff  first  decreases 
and  then  increases  on  p 


Models  1-2:  N  =  5,  c#=  1,  cd=  10, (^  =  10,  eo  =  0.1 


(b)  Comparison  of  models  1-2 


Figure  9:  Numerical  results  of  revised  model  with  forensics. 


3.2.5  Model  Extension  to  Study  Value  of  Forensics 

We  have  extended  this  model  to  study  a  situation  in  which  if  the  spy  is  detected,  the  cost  to  him  is 
proportional  to  how  hard  he  attacked.  The  idea  here  is  that  if  he  attacked  harder,  the  defender  will 
have  more  evidence  to  analyze  to  learn  about  the  attacker.  By  comparing  the  equilibrium  payoffs  with 
and  without  this  feature,  one  can  get  a  measure  of  the  value  to  an  organization  of  investing  in  forensics 
capabilities  (since  without  theses  capabilities  one  could  not  use  the  evidence  left  by  the  attacker  against 
him).  This  study  is  detailed  in  our  paper  [10].  This  work  also  generalizes  the  results  of  this  spy-defender 
game  to  apply  to  more  general  payoff  functions  than  those  described  here.  Figure  9  illustrates  some 
numerical  results  from  this  study.  The  first  panel  shows  the  expected  value  of  threshold  and  attach  strength 
H ,  and  the  equilibrium  defender  payoff  for  the  revised  model,  as  a  function  of  the  prior  probability  p  that 
the  attacker  is  a  spy.  The  second  panel  compares  the  defender  payoff  in  the  revised  model  (with  forensics) 
to  the  older  model  (without). 


3.2.6  Quasi  Zero-Sum  Games 

In  this  work,  the  game  model  has  the  feature  that  it  looks  “almost”  like  a  zero-sum  game.  The  defender’s 
payoff  is  the  opposite  of  the  spy’s,  plus  an  extra  term  than  only  depends  on  the  defender’s  action.  This 
structure  we  call  a  “quasi  zero-sum  game.”  Our  work  on  this  model  shows  that  equilibria  of  quasi  zero-sum 
games  can  be  found  by  solving  a  Linear  Program  (LP),  just  as  true  zero-sum  games  are  widely  known  to 
be  solvable  with  an  LP.  This  finding  is  important  because  quasi  zero-sum  games  can  potentially  model  a 
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wide  range  of  practical  problems  in  the  information  security  domain.  We  are  currently  preparing  a  paper 
discussing  this  finding  and  planning  to  submit  it  to  an  operations  research  journal. 


4  Conclusions  and  Impact 

In  this  project  we  have  demonstrated  the  potential  value  of  not  always  expelling  attackers  detected  in  the 
system.  This  important  observation  has  potential  impact  in  real-world  applications.  We  have  also  studied 
the  problem  of  classifying  attackers  that  are  trying  to  evade  classification.  The  closed-loop  behavior  of 
attackers  trying  to  remain  below  a  detection  threshold  results  in  Nash  Equilibria  being  mixed  in  most 
situations.  This  qualitative  finding  suggests  that  developers  of  security  software  doing  classification  or 
intrusion  detection  should  consider  using  randomized  thresholds.  The  same  observation  may  apply  to  spam 
filtering  software.  The  qualitative  findings  from  this  work,  we  hope,  will  have  an  impact  on  developers  of 
such  systems.  However,  there  are  many  unanswered  questions  that  these  findings  lead  to,  such  as  how  can 
a  designer  of  security  software  choose  the  right  randomization  strategy?  Questions  like  this  are  likely  to 
be  a  topic  of  future  research  for  us,  and  perhaps  others  in  the  research  community. 

Our  results  on  quasi  zero-sum  games  also  have  a  great  deal  of  potential  impact.  There  are  a  large  number 
of  practical  situations  that  are  modeled  much  more  accurately  by  a  quasi  zero-sum  game  than  a  true  zero- 
sum  game.  The  ability  to  efficiently  find  equilibria  of  this  broader  class  of  games  may  have  great  potential 
impact  by  enabling  researchers  to  build  and  analyze  a  broader  class  of  models. 
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